1 Background

1.1 Algoritma

The following coursebook is produced by the team at Algoritma for its Data Science Academy workshops. The coursebook is intended for a restricted audience only, i.e. the individuals and organizations having received this coursebook directly from the training organization. It may not be reproduced, distributed, translated or adapted in any form outside these individuals and organizations without permission.

Algoritma is a data science education center with bootcamp programs offered in:

  • Bahasa Indonesia (Jakarta campus)
  • English (Singapore campus)

1.1.1 Lifelong Learning Benefits

If you’re an active student or an alumni member, you also qualify for all our future workshops, 100% free of charge as part of your lifelong learning benefits. It is a new initiative to help you gain mastery and advance your knowledge in the field of data visualization, machine learning, computer vision, natural language processing (NLP) and other sub-fields of data science. All workshops conducted by us (from 1-day to 5-day series) are available to you free-of-charge, and the benefits never expire.

1.1.2 Second Edition

This coursebook is initially written in 2017.

This is the second edition, written in late August 2020. Some of the code has been refactored to work with the latest major version of R, version 4.0. I would like to thank the incredible instructor team at Algoritma for their thorough input and assistance in the authoring and reviewing process.

1.2 Libraries and Setup

We’ll set-up caching for this notebook given how computationally expensive some of the code we will write can get.

#knitr::opts_chunk$set(cache=TRUE)
options(scipen = 9999)
rm(list=ls())

You will need to use install.packages() to install any packages that are not already downloaded onto your machine. You then load the package into your workspace using the library() function:

#install.packages("plotly")
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.0.3
## Registered S3 methods overwritten by 'tibble':
##   method     from  
##   format.tbl pillar
##   print.tbl  pillar
library(ggpubr)
library(plotly)
## Warning: package 'plotly' was built under R version 4.0.5
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(reshape2)

The Interactive Plotting course, unlike other modules in this Specialization, doesn’t follow a linear learning-path; You’ll be asked to create interactive applications, documents, and a web application throughout the span of this course.

2 Interactive Visualization

As data grow in complexity and size, often times the designer is tasked with the difficult task of balancing overarching storytelling with specificity in their narrative. The designer is also tasked with striking a fine balance between coverage and details under the all-too-real constraints of static graphs and plots.

Interactive visualization is a mean of overcoming these constraints. Quoting from the author of superheat Rebecca Barter, “Interactivity allows the viewer to engage with your data in ways impossible by static graphs. With an interactive plot, viewers can zoom into areas they care about, highlight data points that are relevant to them and hide the information that isn’t.”

More than just interactive visualization, we’ll also learn in this course how to make full-fledged interactive documents, interactive dashboards, and as a bonus, how to create multi-paged PDF documents with the ideal layout of our plots.

I’ll start by introducing plotly.

3 Plotly

Plotly is an interactive, browser-based graphing library that helps data analysts create interactive, high-quality graphs in one of the many supported languages.

Building on what we’ve learned in our last workshop, we’ll learn how to add some nice enhancements and interactivity to our plots using plotly. This works entirelly locally and through the HTML widgets framework, allowing you to create interactive plots directly within RStudio.

We’ll read our data in and perform the (hopefully by now) standard preprocessing procedure:

vids <- read.csv("data_input/youtubetrends.csv")
vids$likesratio <- vids$likes/vids$views
vids$dislikesratio <- vids$dislikes/vids$views

Recall that our videos can take one of the 16 possible video categories. We’ve primarily been working with the News and Media category in our last workshop, so for a change of scenery we’ll be using the videos in the Comedy category for most of our examples.

table(vids$category_id)
## 
##     Autos and Vehicles                 Comedy              Education 
##                     41                    273                    107 
##          Entertainment     Film and Animation                 Gaming 
##                    736                    152                     30 
##        Howto and Style                  Music      News and Politics 
##                    285                    391                    271 
## Nonprofit and Activism       People and Blogs       Pets and Animals 
##                      8                    228                     66 
## Science and Technology                  Shows                 Sports 
##                    175                      1                    188 
##      Travel and Events 
##                     34

Also recalled how we created our custom theme together in the last workshop using theme. Because you’re studying at Algoritma, we’ll save our theme as theme_algoritma:

theme_algoritma <- theme(
  legend.key = element_rect(fill="black"),
  legend.background = element_rect(color="white", fill="#263238"),
  plot.subtitle = element_text(size=6, color="white"),
  panel.background = element_rect(fill="#dddddd"),
  panel.border = element_rect(fill=NA),
  panel.grid.minor.x = element_blank(),
  panel.grid.major.x = element_blank(),
  panel.grid.major.y = element_line(color="darkgrey", linetype=2),
  panel.grid.minor.y = element_blank(),
  plot.background = element_rect(fill="#263238"),
  text = element_text(color="white"),
  axis.text = element_text(color="white")
)

Try and spend a couple of minutes on the code above and have a conceptual understanding of what each line does. This should not be too foreign to you by now! We’ll apply this theme a lot in subsequent ggplot graphics and feel free to revisit this chunk and make any aesthetic adjustments to your liking.

In the past, we’ve relied on R’s base functionality for data preparation, I want to show you a technique that may greatly increase your productivity when working with R. This technique is developed as “a grammar of data manipulation”, and works by providing a consistent set of “verbs” that help you solve the most common data manipulation challenges:

  • mutate() adds new variable
vids <- mutate(vids, likeability = likes/dislikes)
  • select() keeps only the variables we mentioned
channels <- select(vids, c(channel_title, category_id))
tail(channels)
  • filter() returns only the rows based on conditions
filter(vids, views>=25000000)
##   trending_date                                                   title
## 1    2017-11-14             Ed Sheeran - Perfect (Official Music Video)
## 2    2017-11-30 Marvel Studios' Avengers: Infinity War Official Trailer
##          channel_title   category_id        publish_time    views   likes
## 1           Ed Sheeran         Music 2017-11-09 06:04:14 33523622 1634124
## 2 Marvel Entertainment Entertainment 2017-11-29 08:26:24 37736281 1735895
##   dislikes comment_count comments_disabled ratings_disabled
## 1    21082         85067             FALSE            FALSE
## 2    21969        241237             FALSE            FALSE
##   video_error_or_removed publish_hour publish_when publish_wday timetotrend
## 1                  FALSE            6  12am to 8am     Thursday           5
## 2                  FALSE            8   8am to 3pm    Wednesday           1
##   likesratio dislikesratio likeability
## 1 0.04874545  0.0006288700    77.51276
## 2 0.04600069  0.0005821718    79.01566
  • summarise() returns a summary statistics (min, length, mean etc)

Each of these verbs also work with group_by() which allows us to perform any operation “by group”. I’ve attached a full copy of the dplyr cheatsheet in your directory.

A common operation with dplyr is to use group_by and summarise to get a new summary dataframe. This combines nicely with any additional verbs we add to it, through a chaining (%>%) operator. Sounds a little abstract, so let’s dive into an example:

library(dplyr)
v.favor <- vids %>% 
  group_by(category_id) %>%
  summarise(likeratio = mean(likes/views), 
            dlikeratio = mean(dislikes/views)
            ) %>%
  mutate(favor = likeratio/dlikeratio)

We should take a look at the first few rows of the data we just created:

v.favor %>% head()
## Warning: `...` is not empty.
## 
## We detected these problematic arguments:
## * `needs_dots`
## 
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?
## # A tibble: 6 x 4
##   category_id        likeratio dlikeratio favor
##   <chr>                  <dbl>      <dbl> <dbl>
## 1 Autos and Vehicles    0.0176    0.00103  17.2
## 2 Comedy                0.0522    0.00138  37.7
## 3 Education             0.0437    0.00143  30.5
## 4 Entertainment         0.0341    0.00175  19.5
## 5 Film and Animation    0.0338    0.00152  22.3
## 6 Gaming                0.0480    0.00208  23.1

Now using the v.favor dataframe we created and the theme_algoritma we wrote in our last workshop, let’s build a ggplot:

colp <- ggplot(v.favor, aes(x=favor, y=category_id))+
  geom_col(fill="dodgerblue4")+
  labs(title="Favorability Index by Video Category, 2018")+
  theme_algoritma
colp

A simple but pleasant looking bar plot. Adding interactivity using plotly is as simple as wrapping our ggplot object into the ggplotly function:

ggplotly(colp)

As you hover your mouse over, notice the tooltip that shows you the value of our “favorability” index by each category.

Updated: If ggplotly glitches out or return an incorrect-looking plot, try and install ggplot2 from Hadley’s github repo using devtools::install_github('hadley/ggplot2') and restart your R session

Let’s see another example of using the dplyr grammar. Supposed we like to create a summary table that counts the number of appearance each “Comedy” channel has made in that period of trending videos, we could have written the following:

comedy <- vids[vids$category_id == "Comedy", ]
comedy <- aggregate(trending_date ~ channel_title, comedy, length)
comedy <- comedy[order(comedy$trending_date, decreasing=T), ]
names(comedy) <- c("channel_title", "count")
head(comedy)
##                             channel_title count
## 91 The Tonight Show Starring Jimmy Fallon    30
## 60            Late Night with Seth Meyers    21
## 25                           CollegeHumor    15
## 44                         IISuperwomanII    12
## 80                              RM Videos    10
## 34                   ExplosmEntertainment     9

Quiz 1: Using dplyr

Could we have done it easier with dplyr? Refer back to the earlier code chunk and the cheatsheet in your folder to see if you can rewrite the code in dplyr.

From this point on, I’ll leave the creative decision up to you - write R whichever way you prefer! For the most part, to keep the course materials relatively beginner-friendly I’ll use the base R method but where it greatly simplify things, but I may use dplyr in future courses and will expect you to understand them.

Now let’s create a second ggplot object, I’ll name it hexp:

hexp <- ggplot(vids[vids$category_id == "Comedy",], 
               aes(x=likesratio, y=dislikesratio)) +
  geom_point(aes(size=views), alpha=0.5, show.legend = F) + 
  labs(title="Likes vs Dislikes in Trending Comedy Videos", subtitle="Visualizing likes vs dislikes in the Algoritma theme, source: YouTube") +
  theme_algoritma
hexp

Wrapping hexp in our ggplotly() function yields an interactive HTML widget:

ggplotly(hexp)

Plotly works with time series data as well. To show you an illustration of this, I’ll read the economics dataset that ships with ggplot. economics_long is a US economic time series and I’ll use the first three columns of it:

el <- as.data.frame(economics_long[,1:3])

Creating a facet_grid ggplot object with varying y-scales on each of the grid:

econp <- ggplot(el, aes(date, value, group=variable)) + 
  geom_line()+
  facet_grid(variable ~ ., scale = "free_y")+
  labs(title="US Economic time series")+
  theme_algoritma
econp

Creating our ggplotly object to add interactivity in our plots:

ggplotly(econp)

Because it is a plotly object, you can also use supporting plotly functions such as rangeslider() to add a range slider to the x-axis.

rangeslider(ggplotly(econp))

Before you proceed, drag across the range slider and see how your plot reacts to the different values on the range slider.

Now let’s take it further with our plotly experimentation. First, we’ll create a long format data frame containing videos that have comment enabled:

# 4,8,7,9 pointing to category_id, likes, dislikes, comment_count
vids.m <- vids[vids$comments_disabled == F,c(4,7,8,9)]

library(tidyr)
vids.m <- pivot_longer(data = vids.m,
                       cols = -category_id,
                       names_to = "variable",
                       values_to = "value")

As we create our ggplot, then wrap it in ggplotly() as we’ve been doing above:

cplot <- ggplot(vids.m, aes(x=value, y=category_id))+
  # position can also be stack
  geom_col(position="dodge", aes(fill=variable))
ggplotly(cplot)

Quiz 2: Hands-on Plotly Observe the many different functionalities of plotly by playing around with the icon bar in the widget. Try and do each of the following at least once: - Switch from “Show closest data on hover” to “Compare data on hover”
- Toggle Spike Lines
- Click on Legend items to toggle visibility

As a bonus exercise, try and create your own unique plotly starting from the raw data (vids). You are free to use any subsetting and pick any plot type - but the end result have to be a plotly object created using the ggplotly() function. When you’re done, we’ll move onto the next chapter!

4 Publication and Layout Options

We’ll now going to create a multi-page PDF containing all the plots we’ve created so far. To give our publication a consistent style, let’s apply our theme_algoritma to the last plot we created:

cplot <- cplot + theme_algoritma
cplot

Through the ggpubr package, we’ll use the ggarrange to put the 4 plots we created in earlier steps together into a list. Because we specify nrow=2, we imagine that the resulting would be a list of 2 objects, each containing 2 rows (one for each plot):

publicat <- ggarrange(hexp, econp, cplot, colp, nrow=2)

Let’s take a look at the first item on our publicat list:

publicat[[1]]

As well as the second:

publicat[[2]]

Once we’re happy with the result, we can use ggexport() and specify a file name. This will export the list as a multi-page PDF:

ggexport(publicat, filename="assets/publication.pdf")
## file saved to assets/publication.pdf

To visualize interactively, just print publicat from your console or document.

Similar to ggarrange(), plotly allow us to put different plots together into one plotly object using the subplot() function. In this plot, there are 4 subplots, and interacting with any one of them will cause the other subplots to react accordingly to your input:

subplot(
  cplot,
  hexp, 
  colp,
  econp,
  nrows=4)

To see another example, I’m going to go ahead and create 4 ggplots:

hexp <- ggplot(vids[vids$category_id == "Comedy",], 
               aes(x=likesratio, y=dislikesratio)) +
  geom_point(aes(size=views), alpha=0.5, show.legend = F) +
  labs(title="Likes vs Dislikes in Trending Comedy Videos", subtitle="Visualizing likes vs dislikes in the Algoritma theme, source: YouTube") +
  theme_algoritma
hexp

hexp2 <- ggplot(vids[vids$category_id == "Comedy",], aes(x=likesratio, y=dislikesratio))+
  geom_hex(alpha=0.6, show.legend = F)+
  labs(title="Likes vs Dislikes in Trending Comedy Videos", subtitle="Visualizing likes vs dislikes in the Algoritma theme, source: YouTube")+
  theme_algoritma
hexp2

hexp3 <- ggplot(vids[vids$category_id == "Comedy",], aes(x=likesratio, y=dislikesratio))+
  geom_line(col="black", show.legend = FALSE)+
  labs(title="Likes vs Dislikes in Trending Comedy Videos", subtitle="Visualizing likes vs dislikes in the Algoritma theme, source: YouTube")+
  theme_algoritma
hexp3

hexp4 <- ggplot(vids[vids$category_id == "Comedy",], aes(x=likesratio, y=dislikesratio))+
  geom_bin2d(show.legend=FALSE)+
  labs(title="Likes vs Dislikes in Trending Comedy Videos", subtitle="Visualizing likes vs dislikes in the Algoritma theme, source: YouTube")+
  theme_algoritma
hexp4

And use subplot() to arrange them together into one plotly widget with the specified widths.

subplot(
hexp, hexp2, hexp3, hexp4,
  nrows=2, shareX=T, shareY=T, widths=c(0.65, 0.35))

Notice that as we use the interactive selection tools or zoom in on any part of the plot (either plot) the other plots will refresh accordingly - a pretty neat feature considering how simple its set up is!

Note that the common title automatically takes the last plot’s title, so in this case the common (shared) title inherits from hexp4. As of its current development cycle, plotly does not support titles or any similar functionalities yet so adding a subplot title or even a mutual title is a bit hackerish (using annotate()1) and beyond the scope of this coursebook. As and when this change in a future release / update, I will update this coursebook accordingly to include examples.

5 Flex Dashboard

Flex Dashboard is an R package that “easily create flexible, attractive, interactive dashboards with R”. Authoring and customization of dashboards is done using R Markdown with the flexdashboard::flex_dashboard output format. To get started, install flexdashboard using the standard installation process you should be familiar by now:
install.packages(flexdashboard)

When that is done, create a new R Markdown document from within RStudio, choose “From Template” and then Flex Dashboard as following:

The template code that was generated for you takes some default value - for example it chooses to have a columns orientation and set your layout to fill.

If you like your plots to change in height so as to fill the web page vertically, the vertical_layout: fill (default) setting should be kept. If you want the charts to maintain their original height instead, this makes it necessary to have page scrolling in order to accommodate all your plots. That can be done by setting vertical_layout to a scrolling layout using scroll.

Within each of the code chunk of the Rmd template code that was generated for you, you will find it common to enter:
- R graphical output (plot(), ggplot()) - Interactive JavaScript data visualization based on htmlwidgets (plotly)
- Tabular data (table())
- Common summary data, text, values etc

6 Summary

Congratulations on getting started with making interactive plots using plotly and flexdashboard - in the remaining sessions of this workshop we’ll look at creating an interactive document that allow the end user to interact with our creation and we’ll publish our project onto the web in the learn-by-building module.

I hope you’re starting to feel more accomplished from the earlier days when we are all learning the ropes in our first few session. As always, the secret to fluency is practice!

Happy coding!

Samuel

7 Reference Answer

comedy2 <- vids %>% 
  filter(category_id == "Comedy") %>%
  group_by(channel_title) %>%
  summarise(count = n()) %>%
  arrange(desc(count))